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Abstract 

This article provides analytic evaluations of expected (marginal) true-score measures for binary 
items given their IRT calibration. Under the assumption of normal trait distribution, marginalized 
true scores, error variance, true score variance, and reliability for norm-referenced and criterion- 
referenced interpretations are presented as a function of the item parameters. The proposed 
formulas have methodological and computational value in bridging concepts of IRT and true 
score theory. They provide information about the individual contribution of IRT calibrated items 
to marginal true-score measures for the test and may have valuable applications in test 
development and analysis. For example, given a bank of IRT calibrated items, one can select 
binary items to develop a test with known true-score characteristics prior to administering the test 
(without information about raw scores or trait scores.) Calculations with the proposed formulas 
are easy to perform using basic statistical programs, spreadsheet programs, or even hand-held 
calculators. 

Index terms: true score theory, item response theory, reliability. 
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Expected Values and Reliability of Number-Right 
Scores for IRT Calibrated Items 

Tme-score measures and reliability are used in substantive and measurement studies even 
when item response theory (IRT) information about items and persons is available (e.g., with 
standardized tests). Traditionally, such measures represent a common focal point between test 
developers and practitioners as they place the scores and their accuracy in the original scale of 
measurement [e.g., number-right (NR) score]. True (or domain) scores are readily interpretable 
and, for example, when pass-faiTdecisions are made; a cutting score is typically set oh the domain- 
score scale (e.g., Hambleton, Swaminathan, and Rogers, 1991, p. 85). Therefore, it seems totally 
appropriate to argue that IRT estimates and classical estimates of scores and their reliability are 
not mutually exclusive and may coexist in making adequate interpretations and decisions based on 
test data. Combining IRT information about trait scores with readily interpretable true-score 
information will positively impact the quality of test development and analysis. This, however, 
requires better understanding of the relationships between IRT and classical concepts from 
methodological and technical perspectives. As a step in this direction, this article investigates 
relationships between marginal tme-score measures and IRT parameters of binary items. Analytic 
expressions (formulas) of such relationships can be useful in test development and analysis from 
both methodological and technical perspectives. 

Before presenting the theoretical framework for bridging tme-score measures to IRT item 
parameters, an important clarification should be made. As is known, the accuracy of measurement 
in IRT varies across the levels of a latent trait, 6, that underlies the persons’ responses on each 

item. The IRT conditional error variance at 6, cr||^ , inversely related to the information provided 

by the test at 6 (Birnbaum, 1968), is not to be confused with the conditional raw-score variance at 
^ x\e ■ The expected value of the latter (when 6 varies from -«> to °°) is the error variance for the 

raw score (e.g.. Lord, 1980), whereas the expected value of the former is referred to as marginal 
measurement error variance (Green, Bock, Humphreys, Linn, & Reckase, 1984). The marginal 
reliability in IRT is used, for example, as an overall index of precision in computerized adaptive 
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testing for comparison with the classical internal- consistency reliability estimated for paper-and- 
pencil forms (Grreen et al.; Thissen, 1990). Such comparisons, however, require more accurate 
evaluations of the population reliability for paper-and-pencil forms than those provided by sample 
based empirical indexes such as Cronbach’s coefficient alpha (Cronbach, 1951). Some additional 
comments on this issue are provided in the discussion section. 

The formulas proposed in this article, derived under the assumption of normal trait 
distribution, can be very useful in test development and analysis. For example, given a bank of 
IRT calibrated items, one can select items to develop a test (e.g., for follow-up measurements in 
longitudinal studies) with true-score characteristics and reliability known/?nor to data collection. 



Let Pi(0) be the probability for correct response on item / for a person with a trait score 0 
under an appropriate IRT model; one-parameter (IPL), two-parameter (2PL), or three-parameter 
(3PL) logistic model (Bimbaum, 1968). As i;(0) is the item true score at 0, the expected marginal 
number-right (NR) score for a test of n binary items is 



where cp(0) is probability density function (pdf) for the trait distribution. The integration is 

from -«■ to «■ since the ability, 0, is not limited in the theoretical framework of IRT. Also, as the 
product Pj(0)[l - Pj(0)] is the conditional error variance for item i at 0 (Lord, 1980, p. 45), the 
expected error variance for the NR score on a test of n binary items is 



Theoretical framework 



n 




( 1 ) 



n 




The true score variance for the NR score is usually presented (e.g.. May & Nicewander, 1993) as 
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r [nP(0)f(p(6>)d6>- f nP{e)(p{6)de\ , 

J—OO J— 00 



(3) 



where P{9) is the mean ofPi(6) at 6; (/ = 1, n). 



Previous research provides limited applications of Equations 1, 2, or 3 using, for example, 
Gaussian quadrature (Bock & Lieberman, 1970), but analytic solutions are not provided. For 
example, comparing reliability for NR scores and percentile ranks. May and Nicewander (1993) 
evaluated the integrals in Equations (2) and (3) using the Simpson’s Rule with 100 points on the 0 
interval from -5 to 5 after approximating the compound binomial distributions of raw scores. This 
article takes a different approach and provides analytic solutions (formulas) for marginalized true 
score measures at item level thus making it possible to determine (and control) the contribution of 
individual items to the values of p, and reliability indexes at test level. Comments on the 

advantages of the proposed analytic solutions over direct brute-force quadrature integrations are 
provided in the discussion section. 

Given the IRT calibration of binary items, marginalized tme-score measures for a normal 
trait distribution are evaluated in this article at both item and test level. For individual items, 
formulas are provided for the item score (ti^), item error variance [o^(ej)], item true variance 
[o^(Xi)], and item reliability (p^j). At test level, formulas are provided for the population mean of 
NR scores (p), domain score (:i) , error variance (o/), true score variance (g^), reliability (p,^J, 
and dependability index [0(X,)J for criterion-referenced interpretations based on a cutting domain 
score, A. For items calibrated with the 2PLM, tc; and o^(ej) are evaluated through approximation 
formulas (with a negligible approximation error). All other tme-score measures at item and test 
level are represented (explicitly or implicitly) as exact analytic functions of TCjand o^(ej). The next 
sections provide formulas for binary items calibrated with the 2PLM, 3PLM, and IPLM and two 
illustrative examples. The mathematical derivations of the formulas are given in Appendix A. The 
calculations with the proposed formulas are facihtated by the use of a SPSS syntax (SPSS, Inc., 
2002) provided in Appendix B. 
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Formulas for Binary Items Calibrated with the 2PLM 

With the 2PLM, the probability of a correct answer on a binary item i for a person with a 
trait level 0 is determined with 



p. (0) = 

* 1 + exp [Dofj (0 - bj )] ’ 

where: is the item discrimination, is the item difficulty, and Z) is a scaling factor, (with D = 

1.7, the normal-ogive and logistic item-characteristic functions are almost identical). 

Item Score. 

The marginal probability of correct responses on item i is referred to here as item score, 7 t; 
In classical test theory, the empirical estimate of ti; is referred to as item difficulty (although it is, 
in fact, the easiness of the item.) As proven in Appendix A, ti; can be represented as a function 
of the IRT item parameters («; and as 



\-erf(X,) 



( 5 ) 



where — afi / ■^2(1 + af ) and erf is a known mathematics function called the error function. 
With an approximation provided by Hastings (1955, p. 185), the error function (for Xj > 0) can 
be evaluated (with an absolute error smaller than 0.0005) as: 

erf(X) = l-(l + aiX + a2X^ +asX^ +a4X'^)~'^, 

where a, = .278393; -230389; aj = .000972; « 4 = .078108. When X< 0, one can use that 

^^f ~ ~^^f (X). It should be also noted that the erfQC) is directly executable with computer 

programs for mathematics (e.g., MATLAB 5.3; MathWorks, Inc., 1999). Figure 1 represents the 
values of tt; (calculated with Formula 5) as a function of the item parameters a^ and Zj. 

er|c 



7 
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Item error variance. 

As one can see from Equation 2, the marginal error variance for an item i can be obtained 
through the evaluation of the integral 






00 

/)= ^ 
^ —GO 



(m- Pimmdo 



( 7 ) 



With 9 ( 6 ) for the standard normal distribution and D = 1 .7 with the 2PLM, Equation 7 becomes 






00 

T 
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exp[1.7a,(^-^,,)] 



( 1 + exp[1.7a,- (^- 6,)])^ 1-7^ 



exp(-.5^^) 



dO 



( 8 ) 



Since a closed form evaluation of the integral in Equation 8 does not exist, an approximation was 
developed in two steps. First, using the computer program MATLAB 5.3 (MathWorks, Inc., 

1 999), quadrature method evaluations were obtained for practically occurring values from 0 to 3 
for the item discrimination, a-,, and from -6 to 6 for the item difficulty, b^, with a step of 0.01 on 
the logit scale. Second, the results were tabulated and approximated using the three-parameter 
Gaussian function with the regression wizard of the computer program SigmaPlot 5.0 (SPSS Inc., 
1998). The resulting approximation formula is 



(e, ) = m,. exp[-0. 5(Zi,. / ( 9 ) 



where bj is the item difficulty, whereas and depend on the item discrimination (aj): 

/Wj = 0.2646 - 0. 1 18a; + 0.01 87a7 ; 

= 0.7427 + 0.708 1/aj + 0.0074/a7 

Depending on the values and , the error of approximation with Formula 9 varies from 0 to 
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0.005 in absolute value (with a mean of 0.001 and a standard deviation of 0.001). As one can see 
from Formula 9 (graphical illustration in Figure 2), the item error variance is an even function of 
for fixed values of a; . In other words, the value of o^(e;) is the same for b-^ and when the 
value of a; is fixed. As Figure 2 also shows, larger errors occur with average difficulty items and 
smaller errors occur with easy or difficult items. It should be noted also that c^(e) represents an 

additive error variance component of the (total) error variance for the NR score, <j^ . 




Figure 1. Marginal item score for binary items 
as a function of their discrimination {a ^ ) and 
difficulty ) parameters with the 2PLM 




Figure 2. Error variance for binary items as 
a function of their discrimination ) and 

difficulty ) parameters with the 2PLM 



Item True Variance. 

As proven in Appendix A, the item true variance can be represented as an exact function 
of the item score and item error variance: 



^,( 1 - ^/)- 



( 10 ) 



It should be noted also that the derivation of Formula 10 is the same with any IRT model (IPLM, 
2PLM, or 3PLM) and any (not necessarily normal) trait distribution (see Appendix A). 
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Item Reliability 

In classical test theory, the reliability of item i is empirically estimated with the product 
s-^rix, where 5; is the item-score standard deviation and r^is the point-biserial correlation between 
the item score and total test score (e.g., Allen & Yen, 1979, p. 124). This article uses the ratio 
"item true variance to observed item variance " for the evaluation of item reliability (p;;). Thus, 
given the IRT calibration of binary items, the marginal reliability of an item can be evaluated with 



Pii = 



(T,) + CT^(e,)’ 



( 11 ) 



where o\e^ ) and are obtained with Formulas 9 and 10, respectively. Information about the 
reliability of individual items can be particularly useful when the purpose is to select items that 
maximize the internal consistency reliability of test scores (e.g., Allen &Yen, 1979, p. 125). 



Marginal NR Score. 

Given the item score, tc; , of each item in a test of n binary items, the marginal NR. score is 



( 12 ) 

;=1 

Error Variance for the NR Score. 

Given the item error variance, a^(e;), for each item in a test of n binary items, the marginal 
error variance is 



o-l = Z 



(13) 



True Score Variance for the NR Score. 

As proven in Appendix A, the marginal true score variance for a test of n binary items is 



O 

ERIC 
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n n 



(14) 



;=1 7=1 



where n-, and a^(e;) [or tcj and a^(ej)] are obtained with Formulas 5 and 7, respectively. 

Reliability. 

Under the true-score model (Lord & Novick, 1968), the reliability is 



Pxx 



2 2 

O-r 



(15) 



In this article, the theoretical value of p,j,jis evaluated by replacing a/ and in Formula 1 5 with 
their evaluations obtained through Formulas 13 and 14, respectively. 



Dependability Index. 

Brennan and Kane (1977) introduced a dependability index, 0(X), for criterion-referenced 
interpretations in the framework of generalizability theory (GT; e.g., Breiman, 1983) 



®(A) = i7^(p) + (;r-Af 

(7^ (p) + (7T- + cr^ (A) ’ 



(16) 



where a^(p) is the universe-score variance for persons, a\A) is the absolute error variance, n is 
the population mean, and X is the cutting score; (tc and X are in the metric of proportion of items 
correct.) When tc = X, the index 0(X) reaches its lower limit referred to also as index O in GT. As 
Feldt and Brennan (1993) note, "the index 0(X) characterizes the dependability of decisions based 
on the testing procedure, whereas the index O characterizes the contribution of the testing 
procedure to the dependability of such decisions" (p. 141). With the "person x item" (p x i) design 
in GT, the absolute error variance is: cr^(A) = a^(pi,e)ln + a\i)/n. 

As the parameters in Formula 16 are in the metric of proportion of items correct, their 
translation in the framework of this article is (a) a^(p) = af/n^, where is the true variance for 
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the NR score, (b) a^(A) = o^lrp- + where is the error variance for the NR score and 

a^(7Ti ) is the variance of a:; values (/ = 1, n), (c) d^{i) = a^(7Ti), and (d) d^{pi,e) = o^ln. With 
this, the dependability index d>(X) translates into 



0(A) = 



a] +n^{7v- Xf 
+ r?{n- X)^ + + na^{7Tj) 



(17) 



Index 0(X) achieves its lowest limit when ti = X. The resulting dependability index is 



0 = 



ctJ + + ncT^^TTi) 



(18) 



The comparison of Formulas 15 and 18 shows that the dependability index O does not exceed the 
reliability coefficient p^^ Intuitively this also makes sense because, as Feldt and Brennan (1993) 
note, "criterion-referenced interpretations of ‘absolute’ scores are more stringent than norm- 
referenced interpretations of ‘relative’ scores ... O can also be interpreted as a chance-corrected 
index of dependability for criterion-referenced interpretations with squared-error loss" (p. 141). It 
should be stressed that, while the evaluation of p^*, ^(^), and in the framework of GT requires 
sample data (e.g., binary scores). Formulas 15, 17, and 18 in the framework of this article do not 
require such data as long as the IRT item parameters are available. 

Formulas for Binary Items Calibrated with the 3PLM 
With the 3PLM (Bimbaum, 1968), the probability for correct response on item i for a 
person with a trait score 0 [denoted here as /*i’'‘(0)] is provided with 



P* (6) = c, + (1 - c,- ) / {l + exp[-1.7a,. {6 - )]}, (19) 

ERIC 
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where Cj is the pseudo-chance level ("guessing") parameter of the model. In order to distinguish 
true-score measures for items calibrated with the 2PLM from their counterparts with the 3PLM, 

we star the latter (e.g., tt* ). Clearly, Equation 19 can be written as 

p;(e) = c,+(i-c,)p,(e). ( 20 ) 

where P-^Q) is with the 2PLM (see Equation 4). 

Item Score. 

The item score for calibrations with the 3PLM is 

7T* =C, +(1-C,)7T,, (21) 

where 7t; is obtained through Formula 5 for calibrations with the 2PLM. The proof follows directly 
from multiplying on both sides of Equation 20 by (p(6) and integrating each side from to °°. 

Item Error Variance. 

The item error variance for calibrations with the 3PLM is 

) = c,.(l-c,.)(l-;r,.) + (l-c,.)V^(e,.), (22) 

where 7t; and are obtained trough Formulas 5 and 9, respectively, for calibrations with the 
2PLM; (proof in Appendix A). Figure 3 graphically represents values of the item error variance 
(calculated with Formula 22) as a function of the item parameters a-^ and Z>j for a fixed value of the 
pseudo-chance level parameter (q = 0.2). 

Item True Variance. 

The item true variance for calibrations with the 3PLM is 




a\T:) = ;r:(l-7r:)-<7\e:), 



(23) 
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where %* and are obtained with Formulas 21 and 22, respectively. Formula 23 follows 
directly from Formula 10 because the derivation of the latter does not depend on which model is 
used for item calibration (IPLM, 2PLM, or 3PLM). 

Item Reliability 

As with the 2PLM, the reliability of individual binary items calibrated with the 3PLM is 



Pa = 










(24) 



0 4^ O 

where cr (e. ) and a (r, ) are obtained with Formulas 22 and 23, respectively. 



g 

S 0.15 

J 

^ 0.10 
b 

0.05 




fi 



Figure 3. Error variance of binary items as a function of 
their discrimination (a; ) and difficulty (6; ) parameters for 
a fixed pseudo-chance level (q = 0) with the 3PLM, 

True-Score Measures and Reliability at Test Level 

Formulas 12, 13, 14, 15, 17, and 18 for true-score measures and reliability at test level 
with the 2PLM translate directly into their 3PLM counterparts for the marginal NR score, error 
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variance for the NR score, true score variance, reliability, and dependability (it suffice to use 
star notations for the symbols that participate in the right-hand side of each of these six formulas.) 

Formulas for Binary Items Calibrated with the IPLM 

When the discrimination index in Equation 4 is a constant {a = a), the 2PLM translates 
into the IPLM. With the IPLM, however, one should know which computational IRT model had 
been used: logistic (with a scaling constant!) = 1.0) or logistic approximation of the normal ogive 
model (Z) = 1.7). Both options are provided with some computer programs for calibrations with 
the IPLM (e.g., RASCAL; Assessment Systems Corporation, 1995). When the standardization is 
on the trait scores (D = 1.7), one can use the formulas for true-score measures and reliability (at 
item and test level) derived here for the 2PLM {a = constant With the IPLM). This approach does 
not work, however, for a "pure" Rasch model (Z) = 1; Rasch, 1960) in which the standardization 
is on the item difficulty. For this case, formulas for true-score measures and reliability of binary 
items are developed by Dimitrov (in press) for normal and logistic trait distributions. 

Examples 

Simulated Data Example 

In this example, binary scores for 8,000 persons were simulated to fit the 2PLM with the 
standard normal distribution for trait scores, 6 ~A(0,1), and fixed values of aj and b-, for 20 items. 
The purpose of this example is to illustrate the application of the formulas proposed in this article 
for true-score measures and reliability of binary items calibrated with the 2PLM. The empirical 
validation of Formulas 5 and 9 [for 7Cj and ^\e) with the 2PLM] is of particular interest because 
these two formulas are based on approximations. All other formulas are obtained through exact 
derivations and represent (explicitly or implicitly) functions of tc; and 

The data were generated using a computer program written in SAS (SAS Institute, 1985) 
for Monte Carlo simulations of binary data that fit IRT models (Dimitrov, 1996). The assumptions 
of 6 ~A(0,1) and model fit with the 2PLM being met with these simulations, the produced binary 
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scores (for 8,000 persons on 20 items) were analyzed using the computer progr am XCALIBRE 
(Assessment Systems Corporation, 1995). Using the XCALffiRE estimates of a; and (given in 
Table 1) allows us to test the "robustness" of Formulas 5 and 9 when they are used with sample- 
based (i.e., less than "ideal") estimates of item parameters. The evaluations of true-score measures 
and reliability in this example were facilitated through the use of the statistical progr am SPSS 
(SPSS Inc., 2002). The SPSS syntax developed for this purpose (in Appendix B) works for binary 
items calibrated with the 3PLM (input variables: a„ b„ and c^, but it also can be used for items 
calibrated with the 2PLM (with c = 0) or the IPLM (with C; = 0 and a; = constant). The SPSS run 
generates the true-score measures and reliability for each item [tij, g\t-X and py] as "new" 
variables in the SPSS data spreadsheet. At test level, the SPSS printout provides values for the 
marginal NR score (p), error variance for the NR score (af), true variance for the NR score 
{o^), and variance of items scores [ 0 ^( 71 ;;)]. 

The results from the SPSS run in this example (with a^ and from Table 1 and C; = 0) are 
provided in Appendix C. The true-score measures and reliability for individual items (upper panel 
in Appendix C) are given in Table 1. At test level, the SPSS printout (lower panel in Appendix C) 
provides the true score variance for the NR score (o^^ = 6.3 15), the error variance for the NR 
score {of = 3.719), the marginal NR score (p = 8.956), and the variance of Tt; values for the 20 
items [a^(:ti) = .045]. With this, the domain score is 7t = p/« = 8.956/20 = .448 and the reliability 
is -63 (using Formula 15). 

The empirical estimates of true-score measures and reliability for the simulated data were 
also determined and compared to their theoretical counterparts. Most importantly, a strong match 
was found between the theoretical evaluations of tij and o^(ej) and their empirical counterparts 
denoted here as and sf respectively. The empirical item scores, p-^ (provided by XCALIBRE 
for the simulated data) are given in Table 1. The difference between p-^ and X; (also in Table 1) is 
smaller than 0.01 in absolute value. The same is true for the difference between theoretical and 
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empirical item variances: a\e-^ - s^. One can check this quickly and easily using, for example, the 
SPSS spreadsheet for Table 1 and calculating: = pl\ -p^. 

As noted earlier, the empirical validation of the accuracy of Formulas 5 and Formula 9 is 
important because the values of n:; and <s\e^ produced by these two formulas govern the values 
of other true-score measures and reliability indexes. Given the strong match between theoretical 
and empirical estimates for the item score and the item error variance in this example, it is not a 
surprise then that Cronbach’s alpha for the sample of simulated binary scores {N = 8,000) was 
found to be the same as the theoretically evaluated reliability (a = p,„j= .63). Similarly, the mean 
and the variance of the empirical item scores in Table 1 {p = .448 and s^(p;) = 0.044] also match 

their theoretical counterparts [k = 0.448 and a^(;Ti) = 0.045]. Thus, with the assumptions of data 
fit and normal trait distribution met, there is a strong match between the theoretical and empirical 
values of tme-score measures even when the proposed formulas are applied with IRT estimates 
(not "ideal" values) of the item parameters for relatively large samples (in this case, N = 8,000.) 

Real Data Example 

The data for this example consisted of dichotomously scored responses of 4,854 fifth 
graders on 24 multiple-choice items of the Ohio Off-Grade Proficiency Test-Reading (Riverside 
Publishing, 1997) in a large urban area in northeastern Ohio. The items capture four strands of 
learning outcomes defined by the publisher as (a) examining meaning given a fiction or poetry 
text, (b) extending meaning given a fiction or poetry text, (c) examining meaning given a 
nonfiction text, and (d) extending meaning given a nonfiction text. The data were analyzed using 
XCALIBRE with the 3PLM (to accommodate for "guessing" with the multiple-choice items.) For 
the test of data fit XCALIBRE reports a standardized residual statistic for each item. This statistic 
is normally distributed and values in excess of 2.0 indicate misfit with a type I error rate of 0.05. 
In this example, the standardized residuals for the 24 binary items ranged from 0.34 to 1.13 thus 
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indicating that the data fit the 3PLM. The XCALIBRE estimates of item discrimination, a-, , item 
difficulty, b„ and pseudo-chance level, Cj , are provided in Table 2 (the 24 items are grouped by 
strands of learning outcomes.) 

The normal quantile tests (proportion-proportion and quantile-quantile comparisons of the 
observed and expected values) were conducted using SPSS with the trait scores, 9, provided by 
XCALffiRE for the sample data (N= 4,854). The results indicated a good fit of 0 to A^(0,1) thus 
allowing the application of formulas developed in this article (see also Figure 4). The theoretical 
true-score measures and reliability (at item and test level) were evaluated through the use of the 
SPSS syntax in Appendix B (with the item parameters a„ b„ and C; in Table 2 as "input" SPSS 
variables.) The results are summarized in Table 2 by strands of learning outcomes. In terms of 
domain score, the highest performance of the target population of fifth graders is on the learning 
outcome "poetry - constmcting meaning" (;i = .664), whereas their lowest performance is on the 
learning outcome "nonfiction - extending meaning" (;i = .475). The dependability index <h(X,) was 
also calculated (using Formula 17) for values of the cutting score X on the proportion of items 
correct scale from 0 to 1 with a "step" of 0.01 (see Figure 5). As one can see, the dependability of 
pass/fail decisions based on a domain cutting score X= .8 (i.e., 80% items correct) is 0(X) = .90. 

With the data in this example (as with any sample of real data), it is not realistic to expect 
ideal conditions for the assumptions of model fit and normality of the trait distribution. Yet, there 
is still a good match between theoretical and empirical values for item scores (nj versus values 
in Table 2), variance of items scores [a^(:ii) = .027 versus (5^(p^ = .025], domain score (;i = .585 
versus p = .586), and reliability (Pxx= 789 versus Cronbach’s a= .801). Additional comments on 

Pxx its empirical evaluation through Cronbach’s a are provided in the discussion part. 

In this example the 3PLM estimates of item parameters were determined from sample 
data, but the procedures described in the previous paragraph remain the same when a^, b^, and Cj 
are known from previous (or simulated) calibrations with the 3PLM. One can use the procedures 
(without further data collection) to determine the true-score characteristics and reliability for any 
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combination of calibrated items - for example, to develop test booklets with the OOPT-Reading 
test for follow-up reading diagnostics (e.g., in different school districts). 




Theta (logits) 



Figure 4. Frequency distribution (with a normal 
curve fit) of the trait scores for the sample of real 
data on the OOPT-Reading (A^ = 4,854). 




Figure 5. Dependability index, 0(X), as a function 
of the cutting score, X, for the OOPT-Reading. 



Conclusion 

This article provides analytic evaluations (formulas) for marginal true-score measures and 
reliability of binary items as a function of their IRT parameters. Assuming the normal distribution 
of trait scores, the formulas can be applied for items calibrated with the IPLM, 2PLM, or 3PLM 
without information about binary scores or trait scores of persons from the target population. At 
item level, the proposed formulas provide evaluation for the following marginal measures: item 
score (vCj), item error variance [a^(e;)], item true variance [o^(t;)], and item reliability At test 
level, the item true-score measures are "summarized" in formulas for the marginal NR score (p), 
domain score (vc), error variance for the NR score (of), true variance for the NR score (of), 
reliability (p^^), and dependability [0(X,)] for criterion-referenced interpretations (e.g., "pass/fail") 
based on a domain cutting score, %. 
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Brief clarifications about the derivation design for the formulas proposed in this article are 
necessary. For item calibrations with the 2PLM, the formulas for item score (itj, Formula 5) and 
item error variance [<3\e, ), Formula 9] are based on approximations, but the absolute error with 
these approximations is practically close to zero (less than 0.0005, with Formula 5, and less than 
0.005 with Formula 9). All other formulas are obtained through exact derivations that (explicitly 
or implicitly) involve ti; and o^(Cj) - Formulas 10, 11, 12, 13, 14, 15, 17, 18, 21, 22, and 23. Some 
arguments in support of using the formulas proposed in this article versus brute-force numerical 
integrations also seem appropriate. First, the proposed formulas are easy to perform with widely 
used spreadsheets, statistical programs (e.g., SPSS, see Appendix B), or even regular calculators. 
Numerical integrations, instead, require computer programming with more complicated analytic 
expressions (e.g., Gaussian quadratures) thus limiting the range of potential users with studies that 
involve evaluations at true-score level. Moreover, some methods of numerical integrations involve 
procedures that may negatively affect the accuracy. For example, the Simpson’s rule for numerical 
integrations with Equation 4 involves an approximation of the compound binomial distribution of 
raw scores (e.g.. May & Nicewander, 1993) which, in turn, leads to losing accuracy in estimating 
the true score variance. In contrast. Formula 10 (for item true variance) does not use preliminary 
approximations. As a reminder. Formula 5 (for ti;) and Formula 9 [for o^(ej)] use approximations 
(with an error practically close to zero), whereas all other formulas in this article are based on 
exact derivations. Along with technical advantages, the formulas provide theoretical relationships 
that may remain hidden with numerical integrations. Formula 9, for example, shows that the item 
error variance is an even function of the item difficulty, b-^, for fixed values of the discrimination 
index, . Also, while Formulas 10 and 23 reveal relationships between true-score measures for 
item calibrations with the same (e.g., 2PL or 3PL) ERT model. Formulas 21 and 22 connect item 
true-score measures with the 3PLM to item true-score measures with the 2PLM. The proposed 
formulas allow researchers to plan (model, predict) true-score measures, whereas the numerical 
integrations put researchers in a "post-hoc" position. The proposed formulas, therefore, provide 
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more than just calculations - they capture theoretical relationships between concepts of IRT and 
true-score theory that may have useful applications in research and instructional settings (e.g., 
graduate courses in measurement). 

The comparison of theoretical true-score measures and reliability with their empirical 
counterparts for real data also deserves attention. The empirical approach (a) requires information 
about individual binary scores for persons from the target population and (b) provides sample- 
based estimates which may (to a large extent) misrepresent the population parameters for true- 
score measures and reliability. Conversely, the proposed formulas provide accurate evaluation of 
true-score measures and reliability at population level without using sample scores (IRT estimates 
of the item parameters suffice.) It should be emphasized also that Cronbach’s alpha is an accurate 
empirical estimate for reliability (p J only if there is no correlation among errors and the test 
components are essentially tau-equivalent (Noviclc & Lewis, 1967). The evaluation of p^^ in this 
article, however, does not require tau-equivalency (the weaker assumption of congeneric items 
suffice.) As a reminder, test items are (a) congeneric if they measure the same trait and (b) tau- 
equivalent if they measure the same trait and their true scores have equal variances (e.g., 

Joreskog, 1971). It should be also noted that when the tau-equivalency assumption does not hold, 
Cronbach’s alpha underestimates p,;,;. However, Cronbach’s alpha may overestimate p,„jWhen 
there is a correlation among errors, (e.g., Komaroff, 1997; Raykov, 2001). Correlated errors may 
occur, for example, with items that relate to a common stimulus (e.g., same paragraph or graph). 
For example, the fact that (with the real data example in the previous section) Cronbach’s alpha 
(.801) slightly overestimated the theoretical evaluation of p^^ (.789) should not be a surprise as 
some items in the reading test (OOPT-Reading) relate to the same paragraph (i.e., correlated 
errors may occur.) From another perspective, while the marginal reliability for IRT trait scores in 
computerized adaptive testing is evaluated for the population (Green at al., 1984), it is compared 
to Cronbach’s alpha for alternatively used paper-and-pencil forms. Clearly, it is more appropriate 
to compare the theoretical marginal reliability in an IRT system to theoretical evaluations of p^^ 
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(e.g., with formulas provided in this article.) 

As illustrated with the examples in the previous section, given the IRT calibration of 
binary items, one can evaluate their true-score measures and reliability for norm-referenced and 
criterion-referenced interpretations. One can also do this for any combination of items grouped by 
measurement or substantive characteristics (e.g., by content or learning outcomes) without using 
(trait of raw-score) data. This can be particularly useful in developing test booklets for follow-up 
measurements in longitudinal studies using the IRT calibration of items for a base year study. It 
should be noted that in previous studies (e.g.. National Center for Educational Statistics, 1995) 
test booklets that are developed for follow-up measurements are usually compared on average 
item difficulty thus ignoring the effect of the other item parameter(s). With the formulas proposed 
in this article, true-score measures and reliability are evaluated as functions of all item parameters 
(with an appropriate IRT model ) prior to follow-up data collection. The formulas can also be 
incorporated into computer programs for simulation studies thus allowing researchers to generate 
targeted true-score measures from (hypothetical or real) IRT parameters of binary items. 

It is important to emphasize that the formulas proposed in this article deal with marginal 
true-score measures and reliability and, therefore, do not provide conditional information about 
scores and their accuracy at separate trait levels. However, while "diagnostic" IRT information 
about trait measures for individuals is valuable, marginal true-score information about the 
population and the measurement quality of the test is also useful. In a sports analogy, while the 
assessment of individual players is very important, the evaluation of the team as a whole is also 
important. In conclusion, researchers and practitioners can greatly benefit from combining IRT 
conditional information about trait and true-score measures (e.g., using a test characteristic curve) 
with marginal true-score information provided by the proposed formulas. 
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Appendix A 

Proof of Formula 5. 

Formula 5 provides an approximation (with an absolute errOr smaller than 0.0005) for the 
marginal scores of binary items 



TUi = 



\-erf{X,) 



(Al) 



where X,. = a.b^ / -^2(1 + ) and erJ{X^ is the error function (e.g. Hastings, 1955, p. 185) 



erf {X) = (2 / V^)| exp(-«^ )du. 



(A2) 



The Lord’s approximation (Lord & Novick, 1968, p. 377, Equation 16.9.3) for the item 
score (marginal probability for correct response on the item) is 



Ttj — /— — [ 6xp( t / f)dt , (A3) 

■sj Ztt 

where y i / ^1 + . With the substitution t = (and Yj = Xj yfl, respectively) we 

have 

1 r°° 2 1 1 f A, 11 

with which the proof is completed. 

It should be noted that Formula 5 (or Al) provides an exact theoretical relationship, but it 
is referred to here as an approximation formula for X; because the error function, erf(JQ, in this 
formula is evaluated through approximations. With the Hasting’s approach (see Equation 6), the 
approximation error for erfJQ is smaller than 0.0005 in absolute value. If, however, the erf(X^ is 
executed through the use of computer programs for mathematics (e.g., MATLAB; MathWorks, 
Inc.), the absolute approximation error is even smaller than the (practically zero) error of 0.0005. 
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Proof of Formula 10 

Formula 10 represents the item true variance, as an exact function of the item score, 
and item error variance, Using the variance expectation mle VAR(X) = E(X^) - [E(X)]^ 
with X = Pi(9), we have 



with which the proof is completed. 

Proof of Formula 14 

Formula 14 represents the tme score variance for the NR score, as an exact function 
of the item score, tc;, and item error variance, a^(Xi). For unidimensional tests (which are dealt with 
in this article), there is a perfect correlation between the congeneric tme scores (Xjand Xj) of any 
two items, i and j, because of the linear relationship; z = + b^j Xj, where /O, 1 (e.g., Joreskog, 

1971). Thus, the covariance of Xi and Xj is a(Xi, Xj) = a(Xi)a(xj). With this, the variance of the tme 
number-right score on a test of n binary items, x = S Xj; (/ = 1, ..., n), can be represented as 



(7 



= j” {p,(^)- p,{d)[\- pxe)])mde- nf 

" ^ -00 

= j” p,{6)(p{e)de- j” /^.(0[1- p.{d)]cp{d)dd- Ttf 

-00 ^ -00 

= = ^, (1- ^, )- cr^(e, ), 



n n 



n n 




(A4) 



Replacing a(x) and a(xj) in the far right side of Equation A4 with their equivalent expressions in 
Formula 10, we obtain Formula 14. With this the proof is completed. 
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Proof of Formula 22 

Formula 22 represents the error variance for individual binary items calibrated with the 
3PLM, <5\e*) as an exact function of the 2PLM evaluations for item score, Uj, and item error 
variance, Given the relationship between *(0) with the 3PLM and P-^Q) with the 2PLM 
(see Equation 20), it can be easily seen that 



Using Equation A5, the proof of Formula 22 is provided with the following integral manipulations 



^i*(0)[i - ^;(0)] = Ci (1 - Ci )[i - Pi(0)] + (1 - Ci f Pi(0)[i - Pi(0)]. 



(A5) 




P- {e)[\-p* {9)-\(p{e)&e 



= c,(i-c,)f°° <p(e)i9-c,(\-c,)\'" p,{e)(p(e)Ae 

^ —00 J —00 

+(1 - c, )2 f °° p, (^)[1 - p, (e)W)i9 

J —00 



= Ci{\-C^)- Cj (l-Cj) 7 rj + {I- C^fcT^ ( 6 j ) 



= Cj(l- Cj )(1 - 7 rj) + {\- Cj ) 2 C 7 ^ (Cj ). 
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Appendix B 

SPSS Syntax: Evaluation of Marginal True-Score Measures for Binary Items 
Input variables: IRT item parameters (a„ b„ and c,) 

COMPUTE p = .2646- ,118*a + .0187‘(a“2). 

COMPUTE s = .7427 + .7081 /a + .0074/(a**2). 

COMPUTE ve = p*exp(-.5‘((b/s)“2)). 

COMP UTE X = (a‘b)/sqrt(2‘(1 +a“2)). 

COMPUTE erf = (1 +.278393*abs(X) + .230389*X“2 + .000972*(abs(X))“3 + .0781 08*X“4)**4. 
COMPUTEerf= 1 -1/erf. 

IF(X < 0) erf = -erf. 

COMPUTE pi = (1-erf)/2. 

COMPUTE vt = pi*(1 -pi)-ve. 

IF(vt< 0)vt = 0. 

COMPUTE ve = c*(1 -c)*(1 -pi) + ve*((1 -c)**2). 

COMPUTE pi = c + (1-c)*pi. 

COMPUTE vt = pi*(1 -pj)-ve. 

IF(vt<0)vt = 0. 

SET FORMAT = F8.3 ERRORS = NONE RESULTS OFF HEATHER NO. 

FLIP 

VARIABLES a b c pi ve vt. 

VECTOR V = VAR001 TO VAR020. 

COMPUTE Y = 0. 

LOOP #l = 1 TO 20. 

LOOP #J = 1 TO 20. 

COMPUTE Y = Y + SQRT(V(#I)*V(#J)). 

END LOOP. 

END LOOP. 

FLIP VAR001 TO VAR020 Y. 

COMPUTE rot = vt/(vt + ve). 

SET RESULTS ON. 

REPORT FORMAT = AUTOMATIC 
A/ARIABLES = pl’ 've‘ 'vf ’ 

/BRE AK= (TOTAL) 

/SUMMARY = MAX(vt) True score variance:’ 

/SUMMARY = SUBTRACT(SUM(ve) MAX(ve)) (vt (COMMA) (3)) ’Error variance:’ 

/SUMMARY = SUBTRACT(SUM(pi) MAX(pi)) (vt (COMMA) (3)) ’Marginal NR score:’ . 

SELECT IF(CASE_LBL-= ’Y’ ) . 

RENAME VARIABLES (CASE^LBL = ITEM) (ve=var_err)(vt=varjau). 

VARIABLE LABELS pi ’item score’ . 

DESCRIPTIVES 
VARIABLES = pi 
/STATISTICS = VAR . 



Note. The user should specify the number of items (in this example, 20) in the syntax. With 50 items, for 
example, change 20 to 50 and VAR020 to VAR050 (see the bold notations in the four syntax lines.) 
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Appendix C 

Results from the SPSS syntax run (Input variables: a,, b, from Table 1 and c, = 0) 



Item 


a 


b 


c 


pi 


var_err varjau 


roi 


1 


.449 


- 2.554 


.000 


.852 


.120 


.006 


.050 


2 


.402 


- 2.161 


.000 


.790 


.154 


.012 


.074 


3 


.232 


- 1 .551 


.000 


.637 


.220 


.011 


.047 


4 


.240 


- 1 .226 


.000 


.612 


.226 


.012 


.050 


5 


.610 


-.127 


.000 


.526 


.199 


.050 


.201 


6 


.551 


-.855 


.000 


.660 


.188 


.036 


.161 


7 


.371 


-.568 


.000 


.578 


.219 


.025 


.104 


8 


.321 


-277 


.000 


.534 


.228 


.021 


.085 


9 


.403 


-.017 


.000 


.502 


.220 


.030 


.120 


10 


.434 


.294 


.000 


.454 


.215 


.033 


.131 


11 


.459 


.532 


.000 


.412 


.209 


.034 


.138 


12 


.410 


.773 


.000 


.385 


.209 


.027 


.116 


13 


.302 


1.004 


.000 


.386 


.219 


.018 


.074 


14 


.343 


1.250 


.000 


.342 


.206 


.019 


.086 


15 


.225 


1.562 


.000 


.366 


.222 


.010 


.044 


16 


.215 


. 1.385 


.000 


.385 


.227 


.010 


.040 


17 


.487 


2.312 


.000 


.156 


.123 


.008 


.062 


18 


.608 


2.650 


.000 


.084 


.078 


.000 


.000 


19 


.341 


2.712 


.000 


.191 


.146 


.009 


.058 


20 


.465 


3.000 


.000 


.103 


.091 


.001 


.013 


Note, pi 


' = W. 


var^err = 


o^(e); var_tau = 


o^(Ti): 


mi = p, 





Report: 

True score variance: 6.315 
Error variance: 3.719 

Marginal NR score: 8.956 

Descriptive Statistics 





N 


Variance 


item score 


20 


.045 





Dimitrov 



Expected Values and Reliability 



30 



Table 1 

True-Score Measures and Reliability for Simulated Binary Items Calibrated 
with the 2PLM 



Item 


«i 


bi 


(PiT 






Pii 




1 


.449 


-2.554 


.852 (.849) 


.120 


.006 


.050 


-.003 


2 


.402 


-2.161 


.790 (.785) 


.154 


.012 


.074 


-.005 


3 


.232 


-1.551 


.637 (.644) 


.220 


.011 


.047 


.007 


4 


.240 


-1.226 


.612 (.618) 


.226 


.012 


.050 


.006 


5 


.610 


-.121 


.526 (.526) 


.199 


.050 


.201 


-.001 


6 


.551 


-.855 


.660 (.653) 


.188 


.036 


.161 


-.007 


7 


.371 


-.568 


.578 (.577) 


.219 


.025 


.104 


-.001 


8 


.321 


-.277 


.534 (.534) 


.228 


.021 


.085 


.000 


9 


.403 


-.017 


.502 (.503) 


.220 


.030 


.120 


.001 


10 


.434 


.294 


.454 (.456) 


.215 


.033 


.131 


.002 


11 


.459 


.532 


.412 (.416) 


.209 


.034 


.138 


004 


12 


.410 


.113 


.385 (.389) 


.209 


.027 


.116 


004 


13 


.302 


1.004 


.386 (.384) 


.219 


.018 


.074 


-.002 


14 


.343 


1.250 


.342 (.345) 


.206 


.019 


.086 


003 


15 


.225 


1.562 


.366 (.360) 


.222 


.010 


.044 


-.006 


16 


.215 


1.385 


.385 (.379) 


.227 


.010 


.040 


-.006 


17 


.487 


2.312 


.156 (.163) 


,123 


.008 


.062 


.007 


18 


.608 


2.650 


.084 (.092) 


.078 


.000 


.000 


.008 


19 


.341 


2.712 


.191 (.192) 


,,146 


.009 


.058 


.001 


20 


.465 


3.000 


.103 (.099) 


.091 


.001 


.013 


-.004 



“ Observed item score (proportion correct responses) for the simulated data {N = 8,000). 
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Table 2 

True-Score Measures and Reliability by Strands of Learning Outcomes with the OOPT-Reading 



Strand 

Item 




bi 




Tti ipf 




o^(Ti) 


Pii 


7C 






Pxx 


Poetry - 


constructing meaning (n = 10) 








.664 


1.772 


3.293 


.650 


1 1.089 


-.732 


.209 


.767 (.759) 


.135 


.044 


.244 










2 


.948 


-.418 


.220 


.698 (.696) 


.166 


.045 


.214 










5 


.494 


.900 


.226 


.493 (.495) 


.231 


.019 


.076 










6 


.494 


.885 


.234 


.500 (.503) 


.231 


.019 


.075 










7 


.905 


-.672 


.185 


.734 (.727) 


.154 


.041 


.212 










8 1 


.165 


-1.144 


.205 


.847 (.838) 


.099 


.031 


.238 










20 


.594 


-.412 


.209 


.670 (.670) 


.192 


.029 


.131 










21 


.716 


.475 


.237 


.536 (.542) 


.217 


.032 


.129 










22 


.703 


-.492 


.204 


.691 C689) 


.179 


.034 


.160 










23 


.841 


-.504 


.194 


.700 (.696) 


.169 


.042 


.198 










Poetry - 


extending meaning (n 


= 4) 








.596 


0.722 


0.517 


.417 


3 


1.169 


.468 


.159 


.463 (.470) 


.187 


.062 


.248 










4 


.724 


-1.541 


.211 


.855 (.848) 


.110 


.014 


.112 










9 


.554 


-.042 


.197 


.605 (.605) 


.210 


.029 


.122 










24 


.706 


.698 


.177 


.460 (.463) 


.215 


.033 


.134 










Nonfiction - constructing meaning (n = 5) 








.529 


1.025 


0.655 


.390 


10 


.795 


-.226 


.194 


.642 (.641) 


.187 


.043 


.187 










11 


.506 


1.581 


.218 


.404 (.406) 


.228 


.012 


.052 










16 


.809 


-.154 


.192 


.627 (.626) 


.190 


.044 


.190 










17 


.499 


2.076 


.220 


.358 (.362) 


.223 


.007 


.030 










18 


.839 


.075 


.261 


.616 (.622) 


.198 


.039 


.164 










Nonfiction - extending meaning (n = 5) 




• 




.475 


0.952 


0.520 


.353 


12 


.709 


2.238 


.190 


.269 (.276) 


.194 


.002 


.013 










13 


.863 


-.727 


.221 


.753 (.748) 


.151 


.035 


.189 










14 


.686 


.375 


.215 


.541 (.545) 


.215 


.034 


.136 










15 


.795 


.219 


.180 


.546 (.547) 


.203 


.044 


.179 










19 


.812 


1.874 


.170 


.268 (.276) 


.188 


.008 


.041 










Total (n = 24) 












.585 


4.471 


16.520 


.789 



Observed item score (proportion correct responses) for the real data (N = 4,854). 
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